We will work through several exercises using tidycensus
to fetch, wrangle, and map census data.
Be sure to clone or downloaded and unzip the workshop files from: https://github.com/dlab-berkeley/Census-Data-in-R
Then:
Open the folder with the workshop files
Double-click on the R Project file
Census-Data-in-R.Rproj
This should open RStudio - with the Files panel
displaying the workshop folder contents.
Double-click on the file Census-Data-in-R.Rmd to
follow along!
You can also click on the file Census-Data-in-R.html
in the Files tab to open the workshop tutorial in a web brower.
If you installed any of these packages awhile ago, (especially
tidycensus), it’s a good idea to install updates when you can (though not during the workshop as things can break!).
# Uncomment this to install packages, if necessary.
# install.packages(c("here", "tidyverse", "sf", "leaflet", "mapview", "tigris", "tidycensus"))
library(here)
## here() starts at /Users/pattyf/Documents/Dlab/workshops/AY2022/Census-Data-in-R-Sp22
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.1 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.1.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(sf)
## Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
library(leaflet)
library(mapview)
## Warning: multiple methods tables found for 'crop'
## Warning: multiple methods tables found for 'extend'
library(tigris)
## To enable
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
library(tidycensus)
##
## Attaching package: 'tidycensus'
## The following object is masked from 'package:tigris':
##
## fips_codes
These seven libraries should be loaded in your environment now.
# If you run this chunk, output from the "here" function should be visible below. This is your local directory path. We can use this to import files later on.
here()
## [1] "/Users/pattyf/Documents/Dlab/workshops/AY2022/Census-Data-in-R-Sp22"
You need a Census API key to programmatically fetch
census data.
Get it here (pretty quickly): https://api.census.gov/data/key_signup.html
The key will be sent to your email and you will need to click to activate it.
Keep the email with the key open for use in this workshop.
For more info on all available Census APIs see: https://www.census.gov/data/developers/data-sets.html
We’ll begin by fetching US Census data with the
tidycensus R package
The tidycensus package allows R users to quickly fetch data from a select subset of Census databases.
The key tidycensus functions we will use today are:
census_api_key: makes your Census API key available
to tidycensus
load_variables: retrieves a dataframe of available
census data variables
get_decennial: fetch census data for the most recent
decennial censuses - 2000, 2010 (and soon 2020)
get_acs: fetch 2005 - 2020 data from the 1 and 5
year ACS (American Community Survey) data bases
Let’s get started!
Copy and paste your Census API key from your email
Use the tidycensus function census_api_key to register your API key with tidycensus. Don’t forget to put quotes around the key!.
# Install your census api key - long alphanumeric string
census_api_key("THE_BIG_LONG_ALPHANUMERIC_API_KEY_YOU_GOT_FROM_CENSUS")
Another way to add your Census API Key:
I keep my key in a file so no one can see it. One way to do this is
by making a script that creates a variable key, and then using the
source function to add that script as an object into your
coding environment. The code chunk below is an example of how you might
do that:
# source (run) an r script that creates a variable with my key
source("../../keys/census_api_key.R")
#print(my_census_api_key)
# register the key
census_api_key(key = my_census_api_key)
The get_decennial function:
We start by fetching from the 2010 Census with tidycensus’s
get_decennial function. Let’s first talk about the
code.
pop2010 <- get_decennial(geography = "state", # census tabulation unit
variables = "P001001", # variable(s) of interest
year = 2010) # census year
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
head(pop2010)
## # A tibble: 6 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 01 Alabama P001001 4779736
## 2 02 Alaska P001001 710231
## 3 04 Arizona P001001 6392017
## 4 05 Arkansas P001001 2915918
## 5 06 California P001001 37253956
## 6 22 Louisiana P001001 4533372
tidycensus data is
tidy
By default, tidycensus returns data in a tidy, or
long format that allows data for multiple variables to be
contained within the variable and value
columns. This is in contrast to untidy, or wide data where
each variable is in its own column.
tidycensus can return long data if you can add the parameter
output=wide to the function call.
# wide format
pop2010w <- get_decennial(geography = "state", # census tabulation unit
variables = "P001001", # variable(s) of interest
year = 2010, # census year
output="wide") # get output in wide format
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
head(pop2010w)
## # A tibble: 6 x 3
## GEOID NAME P001001
## <chr> <chr> <dbl>
## 1 01 Alabama 4779736
## 2 02 Alaska 710231
## 3 04 Arizona 6392017
## 4 05 Arkansas 2915918
## 5 06 California 37253956
## 6 22 Louisiana 4533372
GEOID columnThe GEOID column is included in tidycensus output.
This is a Census geographic identifier for the tabulation unit.
The GEOID is sometimes called the Census
FIPS code and for most tabulation units these are the
same.
The GEOID is a text string and must be quoted.
What is the GEOID for California?
Decennial census data is gathered from individuals and publicly distributed in aggregated form to protect privacy.
The Census tabulation units are the Census geographies to which the census data have been aggregated.
Some of the most common geographic tabulation units and their tidycensus function abbreviations are shown below, along with required and available filters that limit what data are returned.
| Geography | Definition | Filter(s) | Used in tidycensus |
|---|---|---|---|
| “us” | United States | get_acs(), get_decennial() | |
| “region” | Census region | get_acs(), get_decennial() | |
| “state” | State or equivalent | state | get_acs(), get_decennial() |
| “county” | County or equivalent | state, county | get_acs(), get_decennial() |
| “place” | Census place | state | get_acs(), get_decennial() |
| “tract” | Census tract | state, county | get_acs(), get_decennial() |
| “block group” | Census block group | state, county | get_acs(), get_decennial() |
| “block” | Census block | state, county | get_decennial() only! |
get_decennial Geographic Tabulation Units and
FiltersLet’s work together to fill in the code to fetch state population in
2010 just for California. You can find the code in the file
Solutions.R.
?get_decennial for helpOpen Census-Data-in-R-Challenges.Rmd and use the
get_decennialfunction like we’ve seen above, but fill in the code arguments to fetch State population in 2010 just for California. Solutions are available in the Solutions folder, as needed.
Let’s fetch 2010 population data for CA counties
What changes in the code?
get_decennial(geography = "county", # census tabulation unit
variables = "P001001", # variable(s) of interest
year = 2010, # census year
state='CA') # Filter by state is CA
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## # A tibble: 58 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 06011 Colusa County, California P001001 21419
## 2 06007 Butte County, California P001001 220000
## 3 06001 Alameda County, California P001001 1510271
## 4 06003 Alpine County, California P001001 1175
## 5 06005 Amador County, California P001001 38091
## 6 06009 Calaveras County, California P001001 45578
## 7 06013 Contra Costa County, California P001001 1049025
## 8 06015 Del Norte County, California P001001 28610
## 9 06031 Kings County, California P001001 152982
## 10 06021 Glenn County, California P001001 28122
## # … with 48 more rows
state= filter?You can also filter tidycensus results by county
get_decennial(geography = "county", # census tabulation unit
variables = "P001001", # variable(s) of interest
year = 2010, # census year
state='CA', # Filter by state is CA
county='Alameda') # Filter by county Alameda
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## # A tibble: 1 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 06001 Alameda County, California P001001 1510271
In Census-Data-in-R-Challenges.Rmd, alter the code above to fetch 2010 population for Alameda & San Francisco Counties.
We can visualize the data to get a quick overview of the distribution of data values.
It’s a first step in exploratory data analysis and a last step in data communication.
ggplot2 is the most commonly used R package for data
visualization.
tidyverse package.Let’s use it to visualize the population data.
Use ggplot2 to create an ordered horizontal bar
chart.
pop_plot <- ggplot(data=pop2010,
aes(x=value, y=reorder(NAME,value)) ) +
geom_point()
# display the plot
pop_plot
You can get real fancy with
ggplotif you like.
# create a plot.
pop_plot <- ggplot(data=pop2010,
# set aesthetic variables
aes(x=value/1000000, y=reorder(NAME,value)) ) +
# pick geometry
geom_bar(stat="identity") +
# add theme and titles.
theme_minimal() +
labs(title = "2010 US Population by State") +
xlab("Population (in Millions)") +
ylab("State")
# display the plot.
pop_plot
In the code above we fetched data for total population in 2010 using
the variable "P001001".
That is not an obvious variable name, so how do we get those identifiers?
We can use the tidycensus load_variables function for this.
load_variables functionUse load_variables to fetch all variables used in the
2010 census into a dataframe.
vars2010 <- load_variables(year=2010, # Year or end year for ACS-5yr
dataset = 'sf1', # 'sf1' for decennial census
cache = TRUE) # Save fetched data locally
# How large is the output
dim(vars2010)
## [1] 8959 3
# Take a look with head or View
head(vars2010)
## # A tibble: 6 x 3
## name label concept
## <chr> <chr> <chr>
## 1 H001001 Total HOUSING UNITS
## 2 H002001 Total URBAN AND RURAL
## 3 H002002 Total!!Urban URBAN AND RURAL
## 4 H002003 Total!!Urban!!Inside urbanized areas URBAN AND RURAL
## 5 H002004 Total!!Urban!!Inside urban clusters URBAN AND RURAL
## 6 H002005 Total!!Rural URBAN AND RURAL
Over 3,000 unique variables that describe population and housing characteristics
Organized in 333 Tables
https://www.census.gov/data/datasets/2010/dec/summary-file-1.html
We can sort and filter the vars2010 dataframe to find it.
What 2010 decennial census variable contains…
Median Age
Average Family Size
Number of occupied housing units
Answers are in Solutions.Rmd
Return to Census-Data-in-R-Challenges.Rmd and use the
get_decennialfunction to fetch and plot anAvg Family Sizevaraible by CA County in2010, and name the call as a dataframe,ca_fam_size. Once you’ve done that, plot the dataframe with theggplotcall below. Hint: “P037001”
Repeat the previous challenge with data from the
2000decennial census. Don’t assume variable names are the same across the 2000 and 2010 census
Use load_variables to check!
Census tracts are the most commonly used census tabulation unit.
Let’s fetch population data for the census tabulation unit to tract
Because of the large number of census tracts, you MUST specify a state when requesting these data with tidycensus.
## Fetch population by **tract** for California.
ca_tract_pop2010 <- get_decennial(geography = "tract", # census tab unit
variables = "P001001", # var of interest
year = 2010, # census year
state='CA') # State filter
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# How many tracts in CA
dim(ca_tract_pop2010)
## [1] 8057 4
# take a look
head(ca_tract_pop2010)
## # A tibble: 6 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 06037434004 Census Tract 4340.04, Los Angeles County, Californ… P001001 2796
## 2 06037460000 Census Tract 4600, Los Angeles County, California P001001 4851
## 3 06037460200 Census Tract 4602, Los Angeles County, California P001001 5315
## 4 06037460301 Census Tract 4603.01, Los Angeles County, Californ… P001001 4638
## 5 06037460302 Census Tract 4603.02, Los Angeles County, Californ… P001001 4442
## 6 06037460401 Census Tract 4604.01, Los Angeles County, Californ… P001001 878
Census tract data can be quite large!
Fortunately, you can also limit the results to one or more counties.
tract_pop2010 <- get_decennial(geography = "tract", # census tabulation unit
variables = "P001001", # variable of interest
year = 2010, # census year - only one!
state="CA", # limit to California
county=c("Alameda","Contra Costa")) # & counties
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
dim(tract_pop2010)
## [1] 569 4
What three things are new here?
#urban and rural pop for 3 CA counties
ur_pop10 <- get_decennial(geography = "county", # census tabulation unit
variables = c(urban="P002002",rural="P002005"),
year = 2010,
summary_var = "P002001", # The denominator
state='CA',
county=c("Napa","Sonoma","Mendocino"))
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
variables = c("P002002","P002005")
variables = c(urban="P002002",rural="P002005")
summary_var (a denominator - here,
the total count of all people or households surveyed. Can be used for
calculations like percent of total.)summary_var = "P002001"
# take a lookat the results
ur_pop10
## # A tibble: 6 x 5
## GEOID NAME variable value summary_value
## <chr> <chr> <chr> <dbl> <dbl>
## 1 06045 Mendocino County, California urban 48110 87841
## 2 06055 Napa County, California urban 118194 136484
## 3 06097 Sonoma County, California urban 424102 483878
## 4 06045 Mendocino County, California rural 39731 87841
## 5 06055 Napa County, California rural 18290 136484
## 6 06097 Sonoma County, California rural 59776 483878
The summary_value column comes in handy when you want to
compute percent of total, for example:
# Calculate the percent of population that is Urban or Rural
ur_pop10 <- ur_pop10 %>%
mutate(pct = 100 * (value / summary_value))
# Take a look at the output.
ur_pop10
## # A tibble: 6 x 6
## GEOID NAME variable value summary_value pct
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 06045 Mendocino County, California urban 48110 87841 54.8
## 2 06055 Napa County, California urban 118194 136484 86.6
## 3 06097 Sonoma County, California urban 424102 483878 87.6
## 4 06045 Mendocino County, California rural 39731 87841 45.2
## 5 06055 Napa County, California rural 18290 136484 13.4
## 6 06097 Sonoma County, California rural 59776 483878 12.4
Plots give us compact visual summaries of the data.
## Plot it
myplot <- ggplot(data = ur_pop10,
mapping = aes(x = NAME, fill = variable,
y = ifelse(test = variable == "urban",
yes = -pct, no = pct))) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = abs, limits=c(-100,100)) +
labs(title="Urban & Rural Population in Wine Country",
x="County", y = " Percent of Population", fill="") +
coord_flip()
myplot
Don’t worry if you don’t get all the ggplot code now. It’s here for reference.
Your R skills can help you reformat the data and make it more usable.
Let’s fetch population data for 2010 and 2000 by state.
Then we will combine these into one data frame using the
tidyverses::bind_rows function
# Fetch 2000 population data by state
pop2000 <- get_decennial(geography = "state",
variables = c(pop2000="P001001"),
year = 2000)
## Getting data from the 2000 decennial Census
## Using Census Summary File 1
# Fetch 2010 population data by state
pop2010 <- get_decennial(geography = "state",
variables = c(pop2010="P001001"),
year = 2010)
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# Use tidyverse `bind_rows` function to combine the data for these years
state_pop <- bind_rows(pop2000, pop2010)
# Take a look with head or View
head(state_pop)
## # A tibble: 6 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 01 Alabama pop2000 4447100
## 2 02 Alaska pop2000 626932
## 3 04 Arizona pop2000 5130632
## 4 05 Arkansas pop2000 2673400
## 5 06 California pop2000 33871648
## 6 08 Colorado pop2000 4301261
Any Questions?
Mapping Census Data with tidycensus
You can fetch census geographic data by adding the parameter
geometry=TRUE to tidycensus functions
Under the hood, tidycensus calls the tigris package
to fetch data from the Census Geographic Data APIs.
Only a subset of data available via tigris can be
accessed via tidycensus.
You can then use your favorite R mapping functions or libraries like
plot, ggplot, and tmap to make
maps.
Before fetching census geographic data, we need to set the option
tigris_use_cache to TRUE
Caching saves data locally. This greatly speeds things up if you fetch the same census geographic data repeatedly.
# Tigris options - used by tidycensus
# Cache retrieved geographic data locally
options(tigris_use_cache = TRUE)
tidycensusWe fetch the geospatial data by setting geometry=TRUE.
pop2010geo <- get_decennial(geography = "state",
variables = c(pop10="P001001"),
year = 2010,
output="wide",
geometry=TRUE) # Fetch geometry data for mapping
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
Let’s take a minute to discuss the format of an sf
spatial object.
head(pop2010geo, 3)
## Simple feature collection with 3 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -90.41814 ymin: 41.23796 xmax: -66.9499 ymax: 48.19097
## Geodetic CRS: NAD83
## # A tibble: 3 x 4
## GEOID NAME pop10 geometry
## <chr> <chr> <dbl> <MULTIPOLYGON [°]>
## 1 23 Maine 1.33e6 (((-67.61976 44.51975, -67.61541 44.52197, -67.58774 …
## 2 25 Massachus… 6.55e6 (((-70.83204 41.6065, -70.82373 41.59857, -70.82092 4…
## 3 26 Michigan 9.88e6 (((-88.68443 48.11578, -88.67563 48.12044, -88.67639 …
The tidycensus package uses the R sf
package to manage geospatial data.
R sf objects include:
a dataframe with a geometry column with the name
geometry
a CRS (coordinate reference system), specified
by
For a deeper understanding of the sf package and its
functionality, we recommend
our R-Geospatial-Fundamentals workshop
All geospatial data are referenced to the surface of the earth with a CRS, or coordinate reference system. Anyone working with geospatial data will need to develop an understanding of CRSs.
Fortunately, many of us are familiar with longitude and latitude, which are geographic coordinates. But there are different versions of geographic CRSs. And there are also projected CRSs which transform longitude and latitude to 2 dimensional surface for mapping & analysis.
All census geographic data use the NAD83 geographic CRS.
NAD83 stands for North American Datum of 1983. This CRS is
best for locations in North America.
Many geospatial operations require you transform data to a common CRS before conducting spatial analysis or mapping.
An in-depth discussion of CRSs is outside the scope of this workshop. See Geocomputation in R for more information.
sf Spatial ObjectsWe can use sf::plot to make a quick map the geometry
stored in an sf spatial object.
# plot the geometry column data
plot(pop2010geo$geometry)
The vast geographic extent and non-contiguous nature of the USA makes it difficult to map.
Fortunately, tidycensus includes a shift_geo parameter
to shift AK & HI to below Texas.
pop2010geo_shifted <- get_decennial(geography = "state",
variables = c(pop10="P001001"),
output="wide",
year = 2010,
geometry=TRUE,
shift_geo=TRUE)
## Warning: The `shift_geo` argument is deprecated and will be removed in a future
## release. We recommend using `tigris::shift_geometry()` instead.
## Getting data from the 2010 decennial Census
## Using feature geometry obtained from the albersusa package
## Using Census Summary File 1
## Please note: Alaska and Hawaii are being shifted and are not to scale.
## old-style crs object detected; please recreate object with a recent sf::st_crs()
## Shift Happens!
plot(pop2010geo_shifted$geometry)
You can save any sf data object to a shapefile using
st_write
st_write(pop2010geo_shifted, here("data_out/usa_pop2010_shifted.shp"))
# Check to see if the data was written out to a shapefile
dir(here("data_out"))
You can use the sf plot command to make a map that sets
the color of the geometry by the data values
# Name the column with the variable values to make
# a thematic map, also called a choropleth map.
plot(pop2010geo_shifted['pop10'])
ggplot2 Mapggplot knows what to do with sf objects!
ggplot(pop2010geo_shifted, aes(fill = pop10)) +
geom_sf() # tells ggplot that geographic data are being plotted
Let’s make that map a little nicer to look at.
ggplot(pop2010geo_shifted, aes(fill = pop10)) +
geom_sf(color=NA) + # What does color=NA do
coord_sf(crs = 3857) + # Dynamically change the CRS
scale_fill_viridis_c(option = "viridis") # Change the color palette
# Try different options, e.g.
# plasma, magma, inferno, cividis
In your Census-Data-in-R-Challenges.Rmd file, mreate a map of CA Median Age by county in 2010. Solutions are in the Solutions.Rmd file
We can fetch Census data and the geometry for more than one state or county with same function call.
This is so much easier than any alternative approach!
It can be applied to any available geographic tabulation areas (eg states, counties, tracts, places).
Let’s try it with Census Tracts!
Fetch tract population and geometry data for Bay Area Counties.
bay_counties <- c("Alameda", "Contra Costa", "Marin", "San Francisco",
"Sonoma", "Napa","Solano", "San Mateo", "Santa Clara")
bayarea_pop10 <- get_decennial(geography = "tract",
variables = "P001001",
year = 2010,
state='CA',
county=bay_counties,
geometry=T)
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# Quick map
plot(bayarea_pop10['value'])
Questions?
ACS Data with get_acs
ACS data contains the most recent information about the American population.
We can use the tidycensus function get_acs to
retrieve ACS data using code very similar to
get_decennial.
BUT the workflow is more complex because:
The ACS has a lot more tables and variables, and
The ACS contains sample data, so each ACS
variable that you retrieve with tidycensus will fetch both
an estimate of the value and a margin of
error.
The ACS has two primary data products - the ACS 1 year database and the 5 year database.
The ACS 3 year data product has been discontinued.
The ACS 1 year data is more current but has a larger margin of error and is not available for Census geographies with a population of < 65,000.
So the ACS 5-year data is the most commonly used data set.
Let’s use the load_variables function to get a dataframe
of all variables from the ACS 2015—2019 5-year dataset.
median household income.vars_acs2019 <- load_variables(year=2019, # end year 2016-2020 period
dataset = 'acs5', # the ACS data product
cache = T) # Save locally for future access
# how many variables?
dim(vars_acs2019)
## [1] 27040 3
# Take a look at the resultant dataframe
## What is the variable for median household income?
#View(vars_acs2019)
Let’s fetch the median household income data for Alameda County by Census Tract.
alco_mhhincome <- get_acs(geography='tract',
variables=c(median_hhincome = "B19013_001"),
year = 2019,
state='CA',
county='Alameda',
geometry=TRUE
)
## Getting data from the 2015-2019 5-year ACS
Take a look
head(alco_mhhincome)
## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.2887 ymin: 37.52248 xmax: -121.8779 ymax: 37.81562
## Geodetic CRS: NAD83
## GEOID NAME variable
## 1 06001442301 Census Tract 4423.01, Alameda County, California median_hhincome
## 2 06001437400 Census Tract 4374, Alameda County, California median_hhincome
## 3 06001437701 Census Tract 4377.01, Alameda County, California median_hhincome
## 4 06001402400 Census Tract 4024, Alameda County, California median_hhincome
## 5 06001402500 Census Tract 4025, Alameda County, California median_hhincome
## 6 06001450743 Census Tract 4507.43, Alameda County, California median_hhincome
## estimate moe geometry
## 1 110761 21966 MULTIPOLYGON (((-121.9701 3...
## 2 86210 9325 MULTIPOLYGON (((-122.0926 3...
## 3 64559 6732 MULTIPOLYGON (((-122.0747 3...
## 4 39913 8581 MULTIPOLYGON (((-122.284 37...
## 5 30000 12436 MULTIPOLYGON (((-122.2879 3...
## 6 128737 9289 MULTIPOLYGON (((-121.9066 3...
What is the variable?
plot(alco_mhhincome['estimate'])
First identify the variables of interest.
# Median household income by race/ethnicity: Variables from ACS 2015—19
#All households = "B19013_001",
inc_by_race <- c(White = "B19013H_001",
Black = "B19013B_001",
Asian = "B19013D_001",
Hispanic = "B19013I_001" )
Fetch census tract data for multiple variables at once.
# Fetch the Data
alco_mhhinc_by_race <- get_acs(geography='tract',
variables=inc_by_race,
year = 2019,
state='CA',
county='Alameda',
geometry=T )
## Getting data from the 2015-2019 5-year ACS
Facet maps are a way to create visualizations of
small multiples, or subsets of the data in order to
facilitate comparisons. Here, we use ggplot’s facet_wrap
function to make multiple maps of median household income by race for
Alameda County.
# Create the map
medhhinc_facet_map <- alco_mhhinc_by_race %>%
ggplot(aes(fill = estimate)) +
facet_wrap(~variable) +
geom_sf(color=NA) +
scale_fill_viridis_c(option="magma")
# Display the map
medhhinc_facet_map
#
Make a ggplot map of MEDIAN GROSS RENT (DOLLARS) in San Francisco County by tract using data from the ACS 2015—2019 5-year product.
# Fill in the code to fetch the data - (Solutions.R has the code)
# Median household rent for San Francisco County
alcc_medrent <- get_acs(geography= ,
variables= ,
year = ,
state= ,
county= ,
geometry=)
=======
>>>>>>> 28914a878c5f685ff5f665f2dd8401dee89f99f1:Lessons/Census-Data-in-R.Rmd
In Census-Data-in-R-Challenges.Rmd file, Make a ggplot map of MEDIAN GROSS RENT in San Francisco County by tract using data from the ACS 2015—2019 5-year product. Check Solutions.Rmd for answers, as needed.
Interactive mapping gives the RStudio environment some of the functionality of desktop GIS.
There are a number of R packages that you can use, including:
mapview: quick interactive exploratory data viewing
tmap: great static and interactive maps
Leaflet: highly customizable interactive maps
All of these are based on the Leaflet Javascript Library.
Let’s use mapview to make some quick interactive maps of
our median hhousehold income data
mapview(alco_mhhincome)
mapview(alco_mhhincome, zcol="estimate")
Leaflet is the ggplot2 of interactive mapping. Leaflet
in R follows a tidyverse convention, using pipes (%>%) to create
layers in the mapping object. We can use leaflet to create interactive
maps allowing for more flexibility in design and features we can create.
With added complexity in the code, of course!
# Create a color palette
pal <- colorNumeric(
palette = "YlOrRd",
domain = alco_mhhincome$estimate
)
# specify dataset
leaflet(alco_mhhincome) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
# adjust color palette and ploygon features.
addPolygons(stroke = FALSE, smoothFactor = 0.2, fillOpacity = .5,
color = ~pal(estimate)) %>%
# add legend
addLegend(pal = pal, values = ~estimate,
title = "Median Income",
labFormat = labelFormat(prefix = "$"),
position = "bottomleft")
## Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
## Need '+proj=longlat +datum=WGS84'
In the Census-Data-in-R-Challenges.Rmd, use
mapviewto create an interactive choropleth map of median household rent.
ACS variables can be confusing.
Some ways to identify the best variables to explore:
Web search, especially Census web resources, can help.
The Census Reporter website (https://censusreporter.org) provides another tool for navigating topics, tables, and variable names.
The NHGIS website (nhgis.org) is a great way to browse variables of interest.
We haven’t talked about it but it may be important in your work with ACS data.
Math is needed to combine MOEs when you combine variables.
tidycensus includes some nice functions
for these calculations and a good overview of the topic.tidycensus offers two key functions for fetching census
tabular and geographic: get_acs and
get_decennial.
Support for fetching population estimates and
migration flow census data was recently added to
tidycensus. You can read up on it on the tidycensus
documentation website
Using tidycensus to fetch the tabular data or both
tabular and geographic data is IMO way easier than any alternatives,
IF you (1) know R, (2) know a bit about working with
geographic data in R.
This approach is also scaleable if you want multiple census variables for various locations and tabulation areas.
You can greatly enhance your maps if you make them with
ggplot2 rather than the default plot
command.
Interactive mapping greatly enhances your ability to do exploratory data analysis in RStudio.
Much of this tutorial is based on resources by Kyle Walker, author of
tidycensus. See:
Related D-Lab Workshops
Great online resource for working with spatial data in R
Let’s use the 2010 census data to map the percent of San Francisco (SF) properties that were rented.
To start, identify the variables for the
Total number of housing units
Number of renter occupied units
sf_rented <- get_decennial(geography = , # census tabulation unit
variables = , # number of households rented
year = ,
summary_var = , # Total households
state=,
county=,
geometry=)
And here it is SF Percent Rented Units, 2010
sf_rented <- get_decennial(geography = "tract", # census tabulation unit
variables = "H004004", #number of households rented
year = 2010,
summary_var = "H004001", # Total households
state='CA',
county='San Francisco',
geometry=T)
# take a look at the output
head(sf_rented)
sf_pct_rented <- sf_rented[sf_rented$value > 0,] %>%
mutate(pct = 100 * (value / summary_value))
# Take a look
head(sf_pct_rented)
plot(sf_pct_rented['pct'])